In this notebook we use the following R packages: tidyverse and ggplot2. In addition we use the multiplot functionality provided by Cookbook for R. In Python we use fastai, numpy and pandas. Full code for training and inference available at GitHub. All code for statistical analysis is included in this document.
The success of deep learning models has typically been measured in terms of their predictive power, but they have lacked a principled way of expressing uncertainty about these predictions.
In my master’s thesis we apply recent insights connecting dropout neural networks to Bayesian approximate variational inference (VI). VI is technique for approximating intractable integrals that arise when modelling the posterior distribution in Bayesian inference and machine learning. Gal et. al. [1, 2, 3] have shown that the posterior distribution of a NN can be approximated by leaving dropout on at test time and sampling multiple predictions. This amounts to drawing Monte Carlo (MC) samples from the posterior (a brief description of this process is listed in section 1.1). The Bayesian framework allows us to obtain principled uncertainty estimates when making predictions in these so called MC dropout NNs.
The results of [1,2,3] seem particularly promising when applied to healthcare. In fact, a recent paper published in Nature [4] applies MC dropout to capture uncertainty estimates when predicting the presence of diabetic retinopathy in patients. This paper demonstrates how uncertainty estimates can provide useful information to the clinician tasked with interpreting the results of medical images. The key idea is that this human-machine interaction will lead to overall better results than either could produce individually.
In this notebook I will first briefly introduce the general idea behind MC dropout. Then we will apply MC dropout in a classification task based on the collection of labelled images in the CIFAR-10 data set.
A neural network is made up of many neurons which are organized in layers. Neurons are often called nodes or units, depending on your choice of machine learning literature. Henceforth we will refer to them as units.
In a typical neural network there are many, many units. As a consequence the number of parameters greatly exceeds the number of data points, dramatically increasing the risk of overfitting (i.e. there are many settings of weights which will cause the network to fit the data almost perfectly).
Dropout is a common stochastic regularisation technique [5] that is used to prevent overfitting in neural networks. The term dropout refers to randomly “dropping out” nonoutput units, temporarily removing all connections to the rest of the network. The main idea is that if the presence of other units is unreliable, each unit is forced to learn how to be useful on its own. At test time all units are activated, and the weights of the network are scaled by the dropout rate in order to match the expected output during training.
Recent work by Gal et. al. [1, 2, 3] casts dropout neural networks as approximate Bayesian inference. Their results show that the predictive posterior distribution of a neural network can be approximated by leaving dropout on at test time.
Consider a classification setting, such as in CIFAR-10. Essentially, what happens when applying MC dropout is the following:
Mathematically, model uncertainy is approximated the empirical standard deviation of the predictions for class \(k\), i.e. \[\hat{\sigma}_k = \sqrt{\frac{\sum_{t=1}^T{[p_t(k|X,w) - \hat{\mu}_k}]^2}{T-1}}\] where \[\hat{\mu}_k = \frac{1}{T}\sum_{t=1}^T p_t(k|X,w)\] is the averaged softmax outputs of the predicted class.
Gal et. al. [3] show that the above amounts to drawing Monte Carlo samples from the predictive posterior. Their work demonstrates that applying dropout is effectively the same as defining a Bayesian neural network with a Bernoulli approximating prior over the parameters. Gal has written an excellent blog post that introduces the work and the derivation.
The derivation in Gal et. al. [2] is based on the observation that a shallow neural network with infinitely many weights converges to a Gaussian process (GP) with a specific covariance function. A GP is a non-parametric, probabilistic machine learning method. The idea is that we place a prior distribution over the function space, and by observing new data points we can figure out which function is most likely to have generated the observed data. A GP is fully specified by its mean and covariance functions, and allows us to obtain uncertainty estimates [1]. The extension of this to dropout neural networks is the main result of Gal et. al. [2].
In [1] however, Gal et. al. extend this idea further to include priors over the weights of the convolutional layers in a convolutional neural network (CNN). A CNN is a specialized type of neural network, used primarily and very effectively for image analysis. The authors briefly state that the GP interpretation is lost when the model is extended to CNNs, but that these networks still can be “modelled as Bayesian”.
As far as we are aware, there have been no efforts to determine the correlation between the approximated uncertainty and a network’s ability to predict correctly. Moreover, the work done so far seems to focus on the uncertainty associated with the predicted class. We will examine if there is any additional information to be gained from establishing a connection between the prediction and the runner-up prediction. In other words, our problem statement is:
Are the uncertainty approximations obtained by applying Monte Carlo dropout to convolutional neural networks a reasonable measure of model uncertainty?
State-of-the-art architectures such as ResNet and DenseNet are very powerful, but they are also complicated and their inner workings are quite convoluted. We are primarily interested in examining the correlation between uncertainty estimation and predictive capabilities. It is arguably better to use a simple network architecture to illustrate the idea of MC dropout. In our approach we use LeNet-5. Click on the tabs for implementation details. Full code is available on GitHub.
LeNet-5 was a pioneering 7-layer convolutional neural network originally developed by Yann LeCunn in 1998 for handwritten digit recognition. It is hopelessly primitive compared to contemporary architectures, but still captures the gist of what a convolutional network is while remaining simple enough to allow us to understand every building block of the network. The following chunk shows the model architecture:
class lenet_all(nn.Module):
def __init__(self, conv_size=conv_size, pool_size=2, drop_rate=p):
super().__init__()
self.drop_rate = drop_rate
self.conv1 = nn.Conv2d(in_channels=3, out_channels=192, kernel_size=conv_size)
self.dropmc1 = DropoutMC(p)
self.pool1 = nn.MaxPool2d(kernel_size=pool_size, stride=2)
self.conv2 = nn.Conv2d(in_channels=192, out_channels=192, kernel_size=conv_size, padding=2)
self.dropmc2 = DropoutMC(p)
self.pool2 = nn.MaxPool2d(kernel_size=pool_size, stride=2)
self.dense1 = nn.Linear(in_features=7*7*192, out_features=1000)
self.dropmc3 = DropoutMC(p)
self.dense2 = nn.Linear(in_features=1000, out_features=10)
def forward(self, x):
x = self.dropmc1(self.conv1(x))
x = self.pool1(x)
x = self.dropmc2(self.conv2(x))
x = self.pool2(x)
x = x.view(x.size(0), -1)
x = self.dropmc3(F.relu(self.dense1(x)))
x = self.dense2(x)
return x
Our specification of LeNet-5 differs from the orginial in one crucial way: We use Monte Carlo dropout layers. MC dropout is not a feature that is implemented in PyTorch, and we must therefore implemenet one ourselves. Fortunately, this amounts to a simple adjustment of existing code. We modify the Dropout class to take an additional argument called dropoutMC with default value set to True:
class DropoutMC(nn.Module):
r"""
Modified version of Dropout from torch/nn/modules/dropout.py
Args:
p: probability of an element to be zeroed. Default: 0.5
dropoutMC: If set to ``True``, dropout is turned on at test time. Default: ``True`
inplace: If set to ``True``, will do this operation in-place. Default: ``False``
Shape:
- Input: `Any`. Input can be of any shape
- Output: `Same`. Output is of the same shape as input
Examples::
>>> m = nn.Dropout(p=0.2)
>>> input = autograd.Variable(torch.randn(20, 16))
>>> output = m(input)
.. _Improving neural networks by preventing co-adaptation of feature
detectors: https://arxiv.org/abs/1207.0580
"""
def __init__(self, p=0.5, dropoutMC=True, inplace=False):
super(DropoutMC, self).__init__()
if p < 0 or p > 1:
raise ValueError("dropout probability has to be between 0 and 1, "
"but got {}".format(p))
self.p = p
self.dropoutMC = dropoutMC
self.inplace = inplace
def forward(self, input):
return F.dropout(input, self.p, self.dropoutMC, self.inplace)
def __repr__(self):
inplace_str = ', inplace' if self.inplace else ''
return self.__class__.__name__ + '(' \
+ 'p=' + str(self.p) \
+ inplace_str + ')'
We need to define a function that performs inference over out input. The following chunk contains the relevant code for the sampling procedure described in section 1.1. The function inference stores all the relevant statistics and softmax distributions in a dictionary named output. The results are then turned into a pandas dataframe and some very basic feature engineering is performed. Finally, the data is prepared for statistical analysis in R.
def inference(learner, data, T=100):
''' Function that gathers all relevant numerical results from MC dropout over T iterations.
Arguments:
learner, fastai learner object
data, fastai dataloader
T, number of stochastic forward passes
'''
# Get images, labels and filenames
imgs, labels = next(iter(data.val_dl))
fnames = data.val_ds.fnames
# Empty dictionary to store all output
output = {}
# Empty array to store results in
results = np.empty((T, num_classes))
# iterator index to keep in dictionary
k=0
for (img, label, fname) in list(zip(imgs, labels, fnames)):
for i in range(T):
prediction = learner.predict_array(img[None])
results[i] = prediction
probs = to_np(F.softmax(V(results)))
probs_mean = np.mean(probs, axis=0)
pred_std = np.std(probs, axis=0)
prediction = probs_mean.argmax()
uncertainty = pred_std[prediction]
correct = 1 if prediction == label else 0
output[k] = {"img": fname, "softmax_dist": probs, "probs": probs_mean, "prediction": prediction, "truth": label, "uncertainty": uncertainty, "correct": correct}
k+=1
return output
We will examine data gathered from four variants of LeNet-5. All models were trained on CIFAR-10 using the fastai API. CIFAR-10 contains 60.000 labelled 32x32x3 color images belonging to 10 different classes. The input data was split into a training set of 50.000 images and a test set of 10.000 images. The training set was further split into a training set and a validation set. All models have weight_decay = 0.0005 and all learning rates were chosen using lr_find:
model55: Trained for 60 epochs with a learning rate of 0.001 and kernel size = (5,5) and drop_rate = .5. 0.71280 validation loss at end of training with an accuracy of 0.76137 on the validation data. model55 represents the baseline implementation of LeNet-5. It is identical in structure to the one used by Gal et. al. [1].
model52: Trained for 14 epochs with a learning rate of 0.01 for the first 7 and 0.0001 on the remaining. Changed due to rapid overfitting. Model has kernel size = (5,5) and drop_rate = .2. 0.74186 validation loss at end of training with an accuracy of 0.74406 on the validation data.
model35: Trained for 66 epochs with a learning rate of 0.001 and kernel size = (3,3) and drop_rate = .5. 0.75483 validation loss at end of training with an accuracy of 0.74218 on the validation data.
model32: Trained for 52 epochs with a learning rate of 0.001 and kernel size = (3,3) and drop_rate = .2. 0.75449 validation loss at end of training with an accuracy of 0.74090 on the validation data.
Note that model55 has a slightly better baseline performance than the other models.
The data contains the following variables after it has been prepared for analysis in R:
correct (logical): indicator the is TRUE if the predicted class label matches the true class label, else FALSE.
prediction (int): predicted class label.
truth (int): true class label.
uncertainty (dbl): empirical standard deviation of softmax values for predicted class.
prob1 (dbl): argmax of mean softmax output, i.e. mean probability of predicted class.
prob2 (dbl): mean probability of runner-up prediction.
class2 (int): class label of runner-up prediction.
logit_prob1 (dbl): logit transformation of prob1.
diff (dbl): prob1-prob2
diff_sd_ratio (dbl): diff/uncertainty.
All the variables above are pretty standard, with the exception of diff_sd_ratio. Intuitively, if diff is large, the averaged models all agree that class \(k\) is the correct prediction. If diff is small, the models sampled by MC dropout don’t agree on a single class. Thus diff also serves as a proxy for uncertainty. Model uncertainty, however, is approximated by the empirical standard deviation of the predictions for class \(k\). Thus diff_sd_ratio is expressed by \[\tau_{kj} = \frac{\hat{\mu}_k - \hat{\mu}_j}{\hat{\sigma}_k}\] where \(j\) is the runner-up prediction. \(\tau_{jk}\) gives us a ratio of two different measures of uncertainty.
This section will be divided into to parts. First, we will examine the resulting data from model55, which will be regarded as our baseline model. Next, we will analyse the data from models obtained by varying kernel sizes and dropout rates.
# Importing data
data <- as.tibble(read.csv("~/Documents/Masteroppgave/Data/Resultater/lenet-model55.csv"))
df <- select(data, -X)
# Summarizing entire data set
summary(df)
correct prediction truth uncertainty prob1 prob2
Min. :0.0000 Min. :0.000 Min. :0.0 Min. :0.0000029 Min. :0.1835 Min. :0.0000004
1st Qu.:1.0000 1st Qu.:2.000 1st Qu.:2.0 1st Qu.:0.0400673 1st Qu.:0.5814 1st Qu.:0.0188524
Median :1.0000 Median :5.000 Median :4.5 Median :0.1274580 Median :0.8281 Median :0.1021314
Mean :0.7916 Mean :4.561 Mean :4.5 Mean :0.1184386 Mean :0.7657 Mean :0.1343779
3rd Qu.:1.0000 3rd Qu.:7.000 3rd Qu.:7.0 3rd Qu.:0.1831702 3rd Qu.:0.9700 3rd Qu.:0.2257584
Max. :1.0000 Max. :9.000 Max. :9.0 Max. :0.3526330 Max. :1.0000 Max. :0.4989194
class2 diff diff_sd_ratio logit_prob1
Min. :0.000 Min. :0.0000124 Min. : 0.0 Min. :-1.4929
1st Qu.:2.000 1st Qu.:0.3340403 1st Qu.: 1.7 1st Qu.: 0.3284
Median :4.000 Median :0.7213734 Median : 5.4 Median : 1.5722
Mean :4.082 Mean :0.6312929 Mean : 208.5 Mean : 2.1424
3rd Qu.:6.000 3rd Qu.:0.9503560 3rd Qu.: 23.5 3rd Qu.: 3.4750
Max. :9.000 Max. :0.9999990 Max. :340800.1 Max. :14.3329
For the entire set of classifications, we have the following notable quantities:
The mean uncertainty of the predicted class is 0.1184386 and the median is 0.127458. The minimum value is 2.9342710^{-6}, the maximum is 0.352633. The interquartile range (IQR) is 0.1431029.
The mean softmax output of the predicted class is 0.7656708 and the median is 0.8280935. The minimum value is 0.1834866, the maximum is 0.9999994. The IQR is 0.3885904.
The mean softmax output of the runner-up is 0.1343779 and the median is 0.1021314. The minimum value is 4.111852810^{-7}, the maximum is 0.4989194. The IQR is 0.206906.
The mean difference between the softmax ouputs of the prediction and runner-up is 0.6312929 and the median is 0.7213734. The minimum value is 1.242756810^{-5}, the maximum is 0.999999. The IQR is 0.6163157.
The mean difference to uncertainty ratio, or \(\tau_{jk}\), is 208.5008833 and the median is 5.3560735. The minimum value is 5.325851410^{-5}, the maximum is 3.408001310^{5}. The IQR is 21.8301779.
# Aggregating summary statistics by correct/incorrect
agg_df <- df %>%
group_by(correct) %>%
summarise(n=n(),
mean_prob1=mean(prob1),
mean_prob2=mean(prob2),
mean_uncertainty=mean(uncertainty),
median_uncertainty=median(uncertainty),
sd_uncertainty=sd(uncertainty),
mean_diff=mean(diff),
median_diff=median(diff),
sd_diff=sd(diff),
mean_ratio=mean(diff_sd_ratio),
median_ratio=median(diff_sd_ratio),
sd_ratio=sd(diff_sd_ratio))
agg_df
The model has incorrectly classified 2084 images.
The model has correctly classified 7916 images.
# Summarizing incorrect predictions
incorrect_df <- df %>%
filter(correct==0)
summary(incorrect_df)
correct prediction truth uncertainty prob1 prob2 class2
Min. :0 Min. :0.000 Min. :0.000 Min. :0.001013 Min. :0.1835 Min. :0.000185 Min. :0.000
1st Qu.:0 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:0.153040 1st Qu.:0.4103 1st Qu.:0.173676 1st Qu.:2.000
Median :0 Median :4.000 Median :4.000 Median :0.181794 Median :0.5264 Median :0.239908 Median :4.000
Mean :0 Mean :4.342 Mean :4.049 Mean :0.180244 Mean :0.5451 Mean :0.242866 Mean :4.038
3rd Qu.:0 3rd Qu.:6.000 3rd Qu.:5.000 3rd Qu.:0.210907 3rd Qu.:0.6602 3rd Qu.:0.312481 3rd Qu.:6.000
Max. :0 Max. :9.000 Max. :9.000 Max. :0.333903 Max. :0.9994 Max. :0.498919 Max. :9.000
diff diff_sd_ratio logit_prob1
Min. :0.0000124 Min. : 0.0001 Min. :-1.4929
1st Qu.:0.0959377 1st Qu.: 0.5130 1st Qu.:-0.3629
Median :0.2403054 Median : 1.2146 Median : 0.1055
Mean :0.3022316 Mean : 2.8113 Mean : 0.2501
3rd Qu.:0.4645013 3rd Qu.: 2.4979 3rd Qu.: 0.6642
Max. :0.9992357 Max. :986.4482 Max. : 7.4530
For the entire set of incorrect classifications, we have the following notable quantities:
The mean uncertainty of the predicted class is 0.1802444 and the median is 0.1817945. The minimum value is 0.001013, the maximum is 0.333903. The interquartile range (IQR) is 0.057867.
The mean softmax output of the predicted class is 0.545098 and the median is 0.5263628. The minimum value is 0.1834866, the maximum is 0.9994206. The IQR is 0.2499404.
The mean softmax output of the runner-up is 0.2428664 and the median is 0.2399082. The minimum value is 1.84973710^{-4}, the maximum is 0.4989194. The IQR is 0.1388051.
The mean difference between the softmax ouputs of the prediction and runner-up is 0.3022316 and the median is 0.2403054. The minimum value is 1.242756810^{-5}, the maximum is 0.9992357. The IQR is 0.3685635.
The mean difference to uncertainty ratio, or \(\tau_{jk}\), is 2.8112788 and the median is 1.2145691. The minimum value is 5.325851410^{-5}, the maximum is 986.4482162. The IQR is 1.9848617.
# Summarizing correct predictions
correct_df <- df %>%
filter(correct==1)
summary(correct_df)
correct prediction truth uncertainty prob1 prob2
Min. :1 Min. :0.000 Min. :0.000 Min. :0.0000029 Min. :0.1863 Min. :0.0000004
1st Qu.:1 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:0.0263485 1st Qu.:0.7009 1st Qu.:0.0114910
Median :1 Median :5.000 Median :5.000 Median :0.0965002 Median :0.9002 Median :0.0618456
Mean :1 Mean :4.619 Mean :4.619 Mean :0.1021673 Mean :0.8237 Mean :0.1058168
3rd Qu.:1 3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:0.1672548 3rd Qu.:0.9820 3rd Qu.:0.1758222
Max. :1 Max. :9.000 Max. :9.000 Max. :0.3526330 Max. :1.0000 Max. :0.4886857
class2 diff diff_sd_ratio logit_prob1
Min. :0.000 Min. :0.0007845 Min. : 0.0 Min. :-1.4742
1st Qu.:2.000 1st Qu.:0.5144494 1st Qu.: 2.8 1st Qu.: 0.8516
Median :4.000 Median :0.8352473 Median : 8.6 Median : 2.1997
Mean :4.094 Mean :0.7179230 Mean : 262.7 Mean : 2.6406
3rd Qu.:7.000 3rd Qu.:0.9697941 3rd Qu.: 36.9 3rd Qu.: 3.9965
Max. :9.000 Max. :0.9999990 Max. :340800.1 Max. :14.3329
For the entire set of correct classifications, we have the following notable quantities:
The mean uncertainty of the predicted class is 0.1021673 and the median is 0.0965002. The minimum value is 2.9342710^{-6}, the maximum is 0.352633. The interquartile range (IQR) is 0.1409062.
The mean softmax output of the predicted class is 0.8237397 and the median is 0.9002195. The minimum value is 0.1863078, the maximum is 0.9999994. The IQR is 0.2810591.
The mean softmax output of the runner-up is 0.1058168 and the median is 0.0618456. The minimum value is 4.111852810^{-7}, the maximum is 0.4886857. The IQR is 0.1643312.
The mean difference between the softmax ouputs of the prediction and runner-up is 0.717923 and the median is 0.8352473. The minimum value is 7.844567310^{-4}, the maximum is 0.999999. The IQR is 0.4553448.
The mean difference to uncertainty ratio, or \(\tau_{jk}\), is 262.6516078 and the median is 8.5538778. The minimum value is 0.0029356, the maximum is 3.408001310^{5}. The IQR is 34.0877437.
In the following we will visualize the relationships between our variabels. We start by examining the empirical distribution of the uncertainty estimates \(\hat{\sigma}_k\).
The distribution appears to be bimodal, with peaks close to 0 and 0.2:
# Distribution of estimated uncertainty
p1 <- df %>%
ggplot(aes(x=uncertainty)) +
geom_histogram(col="grey", bins = 50, alpha=.5) +
ggtitle("Distribution of estimated uncertainty")
p1
By grouping the uncertainty estimates by correct (i.e. if the label was correctly predicted or not), we can find out how the predictions contribute to the uncertainty distribution.
The blue line corresponds to the correct predictions, the red line corresponds to incorrect predictions. We see that the incorrect predictions are centered around a higher associated uncertainty, whereas far more of the correctly predicted classes are concentrated around a low uncertainty value. The incorrect classifications greatly contribute to the bimodality, but it is also present in the distribution of uncertainty for the correct classifications.
#Distribution of estimated uncertainty by prediction
p2 <- df %>%
ggplot(aes(x=uncertainty, col=factor(correct))) +
geom_freqpoly(alpha=.7) +
ggtitle("Distribution of estimated uncertainty by classification") +
scale_color_discrete(name="Prediction",
breaks=c("0", "1"),
labels=c("0: Incorrect", "1: Correct"))
p2
The following kernel density estimate plot (using a Gaussian kernel) gives us an idea of how the distributions compare to eachother:
# KDE by correct prediction
p3 <- df %>%
ggplot(aes(x=uncertainty)) +
geom_density(data=incorrect_df, fill="red", alpha=I(.2)) +
geom_density(data=correct_df, fill="turquoise", alpha=I(.2)) +
ggtitle("Kernel density estimates of uncertainty distribution")
p3
The boxplot gives us yet another way to visualize the difference between uncertainty distributions by predictive ability. It is interesting to note the amount of outliers for the incorrect predictions, indicating som negative skew. It is wrong, but uncertainty is low.
# Boxplot of uncertainties for correct vs. incorrect
p4 <- df %>%
ggplot(aes(x=factor(correct), y=uncertainty)) +
geom_boxplot(aes(fill=factor(correct)), alpha=.7) +
labs(x="correct") +
ggtitle("Boxplot of uncertainty distribution by correct/incorrect") +
scale_fill_discrete(name="Prediction",
breaks=c("0", "1"),
labels=c("0: Incorrect", "1: Correct"))
p4
We may also be interested in the relationship between uncertainty and other variables. First, we plot uncertainty against prob1 (the prediction’s softmax output). The softmax outputs in the above plot are colour graded. Outputs close to 1 are red, outputs close to 0 are blue. The plot shows a clear parabolic shape:
# Plotting softmax output of prediction against estimated uncertainty
p5 <- df %>%
ggplot(aes(x=prob1, y=uncertainty)) +
geom_point(alpha=.2) +
ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
labs(x="softmax output of predicted class") +
scale_color_distiller(name="Softmax output",
palette = "Spectral")
p5
p6 <- df %>%
mutate(correct=as.logical(correct)) %>%
ggplot(aes(x=prob1, y=uncertainty)) +
geom_point(aes(fill=correct, col=correct), shape=21, alpha=.5) +
ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
labs(x="softmax output of predicted class")
p6
We can obtain more information by colouring the points by the value of the runner-up predictions. This plot is particularly interesting. The softmax outputs of the runner-up classes \(\hat{\mu}_j\) in the above plot are colour graded. \(\hat{\mu}_j \approx 0.5\) are red, \(\hat{\mu}_j \approx 0\) are blue. The round point is the pair mean values of (prob1, prob2) for the incorrect predictions. The triangular point is the pair mean values for the correct predictions.
We see a clear concentration of red points in the area where the probability of the predictied class \(\hat{\mu}_k \approx 0.5\). If the softmax predictions of both the predicted class and the runner-up are close 0.5, then we have a situation analogous to maximum entropy. This points coincide with the largest approximated uncertainty values.
# Plotting softmax output of prediction against estimated uncertainty, coloured by softmax output of runner-up
p7 <- df %>%
ggplot(aes(x=prob1, y=uncertainty)) +
geom_point(aes(col=prob2), shape=21, alpha=.7) +
geom_point(data=agg_df, aes(x=mean_prob1, y=mean_prob2, shape=as.logical(correct)), size=2.5) +
#geom_point(data=agg_df, aes(y=uncertainty, x=mean_prob1, shape=as.logical(correct)), size=2) +
ggtitle("Uncertainty vs. softmax output of prediction for all observations") +
labs(x="softmax output of predicted class", shape="correct") +
scale_color_distiller(name="runner up",
palette = "Spectral")
p7
In the following plot we split the observations by incorrect/correct predictions, and plot the values of uncertainty against the softmax output of the predicted class:
# Plotting uncertainty vs. softmax output of prediction
p8a <- df %>%
ggplot(aes(x=prob1, y=uncertainty)) +
geom_point(aes(col=prob2), shape=21, alpha=.7) +
ggtitle("Uncertainty vs. softmax output of prediction by incorrect/correct") +
labs(x="softmax output of prediction") +
facet_grid(.~factor(correct)) +
scale_color_distiller(name="Runner up",
palette = "Spectral")
p8a
The black contour lines indicate where most of the points are concentrated. The plot on the left is for incorrect predictions. The right hand plot represents the correct predictions. For the correct predictions, it seems as if far more of the points are concentrated around high predicted output/low runner-up output/low uncertainty. This is not surprising, considering 75% of the correct predictions have a softmax value of approximately 0.7 or above. For the incorrect classifications, most of the points are concentrated around the area of maximum entropy. This indicates that the approximated uncertainty estimates indeed contain valuable information in the incorrect cases.
# Plotting uncertainty vs. softmax output of prediction
p8 <- df %>%
ggplot(aes(x=prob1, y=uncertainty)) +
geom_point(aes(col=prob2), shape=21, alpha=.7) +
geom_density_2d(col="black", alpha=.3) +
ggtitle("Uncertainty vs. softmax output of prediction by incorrect/correct") +
labs(x="softmax output of prediction") +
facet_grid(.~factor(correct)) +
scale_color_distiller(name="Runner up",
palette = "Spectral")
p8
The following plot shows the relationship between uncertainty estimates and the softmax output of the runner-up prediction. Unsurprisingly, model uncertainty increases as the softmax output of the runner-up increases. We have plotted a LOESS estimate of the mean uncertainty as a function of the runner-up output to make this clearer:
# Plotting softmax output of runner-up against estimated uncertainty
p10 <- df %>%
ggplot(aes(x=prob2, y=uncertainty)) +
geom_point(alpha=.2) +
geom_smooth(method="loess") +
labs(x="softmax output of runner-up") +
ggtitle("Uncertainty vs. softmax output of runner-up")
p10
The following plots were generated using Python (code available on GitHub). On the left hand side we see the unnormalized image with the corresponding ground truth label. The plot in the middle shows the softmax output of the predicted class for each of the \(T=100\) stochastic forward passes. \(\mu_k\) is given by the solid red line, \(\mu_j\) is given by the dashed brown line. The plot title shows both the predicted class and the runner-up class. To the right is a kernel density estimate (using a Gaussian kernel) of the \(T=100\) softmax outputs for the predicted class. The plots show the top 5 most uncertain classifications in the entire data set.
For some unknown reason, the runner-up prediction for the five most uncertain but correct predictions are all “airplane”. One possible explanation is the presence of large areas of blue and/or white, which we might assume would be present as backgrounds in images of planes flying through the sky.
As mentioned in section 2.2, the connection established between dropout neural networks and GPs are lost when applied to convolutional neural networks. Performing a logistic regression gives us a simple way of testing if the approximated uncertainty is a significant predictor of the model’s ability to predict correctly.
The coefficient associated with uncertainty is highly significant, indicating that the estimates gathered from performing MC dropout are indeed a useful quantification of predictive uncertainty. Given a unit increase in uncertainty, we expect the probability of predicting correctly to decrease.
# Fitting logistic regression model to check significance of uncertainty
model_sd <- glm(factor(correct)~uncertainty, data=df, family = binomial(link="logit"))
summary(model_sd)
Call:
glm(formula = factor(correct) ~ uncertainty, family = binomial(link = "logit"),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6704 0.2409 0.3588 0.7100 2.0114
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.55249 0.07625 46.59 <2e-16 ***
uncertainty -15.40803 0.43250 -35.63 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10236.6 on 9999 degrees of freedom
Residual deviance: 8467.6 on 9998 degrees of freedom
AIC: 8471.6
Number of Fisher Scoring iterations: 5
The question is then: How do we determine a reasonable uncertainty value for referral to a human expert? As we saw in section 3.1.2, the mean uncertainty values of the incorrect predictions higher than for the correct predictions (0.1802444 vs. 0.1021673, respectively).
Naively, we could set the threshold for referral to the mean uncertainty of the incorrectly classified images:
# Counting number of correct/incorrect by uncertainty >= .18
referral_mean <- df %>%
filter(uncertainty>=.18) %>%
count(correct)
referral_mean
The problem here is apparent: Many of the correctly classified images are associated with a relatively high level of uncertainty (in our case, this is due to the bimodality of the uncertainty distribution in section 3.2.1). The result of our logistic regression suggests that uncertainty is a significant predictor of a model’s ability to predict correctly.
However, although uncertainty seems to contain valueable information, the relative uncertainties of the correct/incorrect observations (in this case) may be too small to differentiate which images should be referred to an expert. We may need to amplify the quantification of uncertainty in some way.
For the most uncertain incorrectly classified images in section 3.4.1 the runner-up suggestions are non-sensical. Still, is there any information to be obtained from the runner-up predictions? What would happen to our overall accuracy if we used the runner-up predictions for all incorrect classifications?
# Counting classification accuracy if runner-up is equal to ground truth
class2_df <- df %>%
mutate(correct=replace(correct, correct==0 & class2==truth, 1)) %>%
count(correct)
class2_df
Accuracy would rise to 0.9079. This indicates that there may be some valuable information to be gathered from the runner-up predictions.
One possible approach is to use the value of \(\tau_{jk}\) introduced in section 2.5. Recall that \(\tau_{jk}\) incorporates information about the runner-up prediction into our quantification of uncertainty. Consider the following cases:
\(\tau_{kj} \approx 1\) means that diff and uncertainty are relatively similar. This happens if a) the models have failed to reach a consensus (diff is small) but model uncertainty is low, or b) the models have reached a consensus (diff is large) but model uncertainty is high. Let’s call these referral predictions.
\(\tau_{kj} \to 0\) means that uncertainty is much larger than diff. These should represent uncertain predictions.
\(\tau_{kj} \to \infty\) means that uncertainty is much smaller than diff. These should represent non-referral predictions.
As we saw in table 3.1.2, the mean \(\tau_{jk}\) for the incorrectly classified images is 2.8112788 and 262.6516078 for the correctly classified images. The respective medians are 1.2145691 and 8.5538778.
As a first step, setting the referral threshold to \(\tau_{jk} \leq 2.81\) (the mean uncertainty of the incorrect classifications) gives:
# Counting number of correct/incorrect by tau <= 2.8
referral_meantau <- df %>%
filter(diff_sd_ratio <= 2.81) %>%
count(correct)
referral_meantau
We run into the same problem as when we used the mean uncertainty: Many of the correctly predicted images would be referred, in fact more than when we used the uncertainty estimates. It is interesting to note, however, that we get far more referrals of incorrectly classified images.
Upon inspection, the boxplot of the log-transformed \(\tau_{jk}\) clearly shows the presence of extreme values (outliers marked in red) in both tails. For incorrect predictions, the distribution seems to be negatively skewed. The distribution of the correct predictions appear to be positively skewed.
pt <- df %>%
ggplot(aes(x=factor(correct), y=log(diff_sd_ratio))) +
geom_boxplot(aes(fill=factor(correct)), outlier.colour = "red", outlier.alpha = .15) +
xlab("correct") +
ylab("log-transformed tau") +
ggtitle("Presence of extreme uncertainty values") +
labs(fill="correct")
pt
Perhaps a more robust measure of the central tendency of \(\tau_{jk}\) is the median. Setting the referral rate of \(\tau_{jk} \leq 1.2\) gives the following:
# Counting number of correct/incorrect by tau <= 1
referral_medtau <- df %>%
filter(diff_sd_ratio <= 1.2) %>%
count(correct)
referral_medtau
This is a clear improvement over our the mean approach, in terms of the amount of referred images. 607 fewer incorrect images are referred compared to 1172 fewer correct images.
When compared to using the mean of \(\hat{\sigma}_k\) as a cut-off for referral, the median of \(\tau_{jk}\) results in 37 fewer incorrect images are referred compared to 747 fewer correct images. This could indicate that \(\tau_{jk}\) is a useful quantification of uncertainty.
BRAINSTORMING, USING RUNNERS-UP:
# What is accuracy if we use diff_sd_ratio and runner-up as prediction?
ref1 <- df %>%
mutate(correct=replace(correct, diff_sd_ratio<=1.2 & correct==0 & class2==truth, 1)) %>%
count(correct)
ref1
# What is accuracy if we use uncertainty and runner-up as prediction?
ref2 <- df %>%
mutate(correct=replace(correct, uncertainty>=.18 & correct==0 & class2==truth, 1)) %>%
count(correct)
ref2
IDEA: Using LDA to find best tau.
# Fitting logistic regression model check significance of log-transformed diff_sd_ratio
model_ratio <- glm(factor(correct)~log(diff_sd_ratio), data=df, family = binomial(link="logit"))
summary(model_ratio)
Call:
glm(formula = factor(correct) ~ log(diff_sd_ratio), family = binomial(link = "logit"),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.5499 0.0773 0.3446 0.6300 3.0568
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.36023 0.03255 11.07 <2e-16 ***
log(diff_sd_ratio) 0.86145 0.02230 38.63 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10236.6 on 9999 degrees of freedom
Residual deviance: 7612.8 on 9998 degrees of freedom
AIC: 7616.8
Number of Fisher Scoring iterations: 6
# Checking for interactions between uncertainty and log-transformed diff_sd_ratio
model_inter <- glm(factor(correct)~uncertainty*log(diff_sd_ratio), data=df, family = binomial(link="logit"))
summary(model_inter)
Call:
glm(formula = factor(correct) ~ uncertainty * log(diff_sd_ratio),
family = binomial(link = "logit"), data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.5057 0.0137 0.2455 0.7045 2.8495
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.24145 0.15900 -1.519 0.12887
uncertainty 2.81600 0.80255 3.509 0.00045 ***
log(diff_sd_ratio) 1.51152 0.07259 20.823 < 2e-16 ***
uncertainty:log(diff_sd_ratio) -4.49845 0.38959 -11.547 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10236.6 on 9999 degrees of freedom
Residual deviance: 7454.4 on 9996 degrees of freedom
AIC: 7462.4
Number of Fisher Scoring iterations: 7
# Checking models
AIC(model_sd, model_ratio, model_inter)
We gather the data from all models in a single dataframe df:
Note that we have added two new variables to the data set: kernel (factor, convolution size) and dropout (factor, dropout rate). In the following chunk we generate some summary statistics:
Overall:
model52 is the most certain.
model35 is the least certain.
It seems as though models trained with a higher dropout rate are more accurate, but less certain.
The mean differences between the predicted and runner-up class are close to 0.6 for all models.
The models with the lowest mean uncertainty correspond with large mean values of \(\tau_{jk}\).
model52: Accuracy has increased by 0.01514 percent.model55: Accuracy has increased by 0.03023 percent.model32: Accuracy has increased by 0.0167 percent.model35: Accuracy has increased by 0.03342 percent.[1] Gal, Y. & Ghahramani, Z. Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference. arXiv: 1506.02158 (2015).
[2] Gal, Y. & Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv: 1506.02142 (2015).
[3] Gal, Y. & Ghahramani, Z. Dropout as a Bayesian Approximation: Appendix. arXiv: 1506.02157 (2015).
[4] Leibig, C. & Allken, V. et. al. Leveraging uncertainty information from deep neural networks for disease detection. Scientific Reports volume 7, Article number: 17816 (2017).
[5] Hinton, G. & Srivastava, N. et. al. Improving neural networks by preventing co-adaptation of feature detectors. arXiv: 1207.0580 (2012).
[6] Leibig, C. & Allken, V. et. al. Leveraging uncertainty information from deep neural networks for disease detection: supplementary material. Scientific Reports volume 7, Article number: 17816 (2017).